In [1]:
%matplotlib inline
In [85]:
from IPython.display import Image
import numpy as np
import matplotlib.pyplot as plt
# some classification metrics
# more here:
# http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
from sklearn.metrics import (auc, roc_curve, roc_auc_score,
accuracy_score, precision_score,
recall_score, f1_score, )
from sklearn.cross_validation import train_test_split
# a few classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
In [25]:
# breast cancer dataset, a binary classification task
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
x_train, x_test, y_train, y_test = train_test_split(cancer.data,
cancer.target,
test_size=0.4,
random_state=0)
In [43]:
x_train.shape
Out[43]:
In [26]:
# only two class, 0 and 1
set(y_train)
Out[26]:
In [30]:
clf = LogisticRegression()
clf.fit(x_train, y_train)
Out[30]:
In [46]:
# score by default calculate the accuracy for a classifier
print '%30s: %s' % ('Default score (accuracy)', clf.score(x_train, y_train))
In [34]:
# predict outputs the predicted class label given the dataset features
# this might seem odd given that the model has already seen all of x_train
# however most of the time models do not fit the training data perfectly
# therefore you didn't see 100% accuracy above
# as you increase the model complexity (e.g. random forest, neural network, etc.)
# there's a higher likelihood that your model will fit the training data perfectly
# but as you'll learn this is most likely not a good thing (i.e. overfit)
predicted_labels = clf.predict(x_train)
predicted_labels[:5]
Out[34]:
In [48]:
# notice that this is the same as what we computed earlier
print '%30s: %s' % ('Accuracy', accuracy_score(y_train, predicted_labels))
In [49]:
# precision is calculated as the ratio of true positives
# over the sum of true positives and false positives
# we'll come back to this later
print '%30s: %s' % ('Precision', precision_score(y_train, predicted_labels))
In [50]:
# recall or sensitivity is the ratio of true positives
# over the sum of true positives and false negatives
# we'll come back to this later
print '%30s: %s' % ('Recall', recall_score(y_train, predicted_labels))
Note: These are considered VERY good quality metrics in most cases which should raise suspicision. In this case the problem is that we're training and calculating scores on the same dataset.
Quiz: Why is training and testing on the same dataset a bad idea?
Looking at the roc_auc_score signature from the sklearn documentation:
roc_auc_score(y_true, y_score, average='macro')
One might be tempted to try the following:
In [54]:
print '%30s: %s' % ('AUC (not correct)', roc_auc_score(y_train, predicted_labels))
The problem here is that we're not passing the correct second argument.
y_score has to be:
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
Let's calculate the probability estimates. With most classifiers you can get this using predict_proba.
In [55]:
predicted_probabilities = clf.predict_proba(x_train)
In [56]:
predicted_probabilities[:5]
Out[56]:
For each record in the training dataset predict_proba outputs a probabilty per class. In our example this takes the form of:
[probability of class 1, probability of class 2]
These two probabilities sum to one. For example the first row (4.34552482e-03 + 9.95654475e-01 ~=1).
In [57]:
predicted_probabilities.shape
Out[57]:
Make sure that all rows sum to 1.
In [76]:
assert all(predicted_probabilities.sum(axis=1) == 1), "At least one row is not summing to one"
Quiz: what is assert
? what is all
? what is axis=1
doing? why is there no output?
Let's calculate the the AUC score by passing the correct metric. We have to pass the probability for the positive class which corresponds to the 2nd column.
In [72]:
print '%30s: %s' % ('AUC', roc_auc_score(y_train, predicted_probabilities[:, 1]))
Passing the wrong column (negative class) result in 1-AUC.
In [74]:
print '%30s: %s' % ('1 - AUC', roc_auc_score(y_train, predicted_probabilities[:, 0]))
In [77]:
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
Out[77]:
In [80]:
predicted_labels = clf.predict(x_train)
print '%30s: %s' % ('Accuracy', accuracy_score(y_train, predicted_labels))
In [78]:
predicted_probabilities = clf.predict_proba(x_train)
print '%30s: %s' % ('AUC', roc_auc_score(y_train, predicted_probabilities[:, 1]))
In fact we can calcualte all these metrics for any classifer (except for ROC/AUC). So let's refactor the code and make it more generic.
In [81]:
def classifier_metrics(model):
clf = model()
clf.fit(x_train, y_train)
print '%30s: %s' % ('Default score (accuracy)', clf.score(x_train, y_train))
predicted_labels = clf.predict(x_train)
print '%30s: %s' % ('Accuracy', accuracy_score(y_train, predicted_labels))
print '%30s: %s' % ('Precision', accuracy_score(y_train, predicted_labels))
print '%30s: %s' % ('Recall', accuracy_score(y_train, predicted_labels))
print '%30s: %s' % ('F1', f1_score(y_train, predicted_labels))
try:
predicted_probabilities = clf.predict_proba(x_train)
print '%30s: %s' % ('AUC', roc_auc_score(y_train, predicted_probabilities[:, 1]))
except:
print '*** predict_proba failed for %s' % model.__name__
In [88]:
for model in [LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, SVC]:
print 'Metrics for %s' % model.__name__
print '=' * 50
classifier_metrics(model)
print '\n'
Let's say that our "cancer" classifier is predicting the following probabilities for patients A, B, C, D, and E:
Which patients should we call in for further screening?
Before we proceed here's some terminology:
Here are a few different ways to go about it:
This is the essence of the ROC curve. The number you pick as the threshold gives you one point on the ROC curve. Plotting the ROC curve involves changing the threshold from 1 to 0 in small increments and plotting the corresponding points (more details to follow).
In [115]:
# same as before
clf = LogisticRegression()
clf.fit(x_train, y_train)
predicted_probabilities = clf.predict_proba(x_train)
roc_auc = roc_auc_score(y_train, predicted_probabilities[:, 1])
roc_curve
generates the coordinates of the ROC curve (fpr and tpr) that are needed for plotting it. It also generates the
In [ ]:
fpr, tpr, thresholds = roc_curve(y_train,
predicted_probabilities[:, 1])
Note that the thresholds are not equi-distance. This has to do with removing redundant coordinates.
In [129]:
# distance between consequtive thresholds
# x-axis is diff index, y axis is difference
plt.plot(np.abs(np.diff(thresholds)));
In [108]:
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.2f'% roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.show()
Coming soon
Page 197-
Coming soon
https://github.com/Quartz/bad-data-guide/blob/master/README.md
https://www.quora.com/What-are-the-best-ways-to-account-for-missing-data-in-machine-learning
http://diggdata.in/post/90435663721/dealing-with-missing-values-introduction
http://nerds.airbnb.com/overcoming-missing-values-in-a-rfc/ (fill using KNN)
https://www.wikiwand.com/en/Imputation_(statistics)
How does CART deal with missing values?
https://www.quora.com/How-does-XGBoost-treat-missing-values-during-training-and-prediction
https://github.com/dmlc/xgboost/issues/21
https://github.com/scikit-learn-contrib/imbalanced-learn
http://www.fundraisingwithr.com/solutions-for-modeling-imbalanced-data/
http://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html